Description and between-group comparisons

Biostatistics Support and Research Unit

Germans Trias i Pujol Research Institute and Hospital (IGTP)
Badalona, Spain

March 26, 2025

Index

1. Introduction

2. Descriptive tables

3. gtsummary

4. Descriptive graphics

5. ggplot2

6. Comparison between groups

     6.1 Independent samples

     6.2 Paired samples

     6.3 Effect size

Introduction

Descriptive statistics

Techniques used to summarize categorical and numerical data of a sample or population in a meaningful way, both numerically and graphically.

  • It is a must do first step when conducting research before any inference.

  • Depending on the scope of a project, descriptive statistics alone may be sufficient, or they can be used to prepare further statistical analyses.

  • Descriptive statistics can be used to describe a single variable (univariate) or the relationship between more than one variable (bivariate/multivariate).

  • The variable type (numerical/categorical) determines which types of descriptive statistics we should use.

trial dataset

  • We will use the trial dataset from the {gtsummary} package:
library(gtsummary)
trial
trt age marker stage grade response death ttdeath
Drug A 23 0.160 T1 II 0 0 24.00
Drug B 9 1.107 T2 I 1 0 24.00
Drug A 31 0.277 T1 II 0 0 24.00
Drug A NA 2.067 T3 III 1 1 17.64
Drug A 51 2.767 T4 III 1 1 16.43

Descriptive tables

Categorical variables

  • Categorical variables (nominal or ordinal) usually are described by the numbers of subjects (absolute frequency) together with percentages (relative frequency) for each category:
N = 200
Chemotherapy Treatment, n (%)
    Drug A 98 (49%)
    Drug B 102 (51%)
T Stage, n (%)
    T1 53 (27%)
    T2 54 (27%)
    T3 43 (22%)
    T4 50 (25%)
Grade, n (%)
    I 68 (34%)
    II 68 (34%)
    III 64 (32%)

Categorical variables

  • Categorical variables (nominal or ordinal) usually are described by the numbers of subjects (absolute frequency) together with percentages (relative frequency) for each category:
Drug A
N = 98
Drug B
N = 102
T Stage, n (%)

    T1 28 (29%) 25 (25%)
    T2 25 (26%) 29 (28%)
    T3 22 (22%) 21 (21%)
    T4 23 (23%) 27 (26%)
Grade, n (%)

    I 35 (36%) 33 (32%)
    II 32 (33%) 36 (35%)
    III 31 (32%) 33 (32%)

Numerical variables

  • Numerical variables (continuous or discrete) are described by a central tendency statistic to know which value is the most “typical”, together with a measure of how spread out observations are around this value (variability).

  • The three most common measures of central tendency are:

    Mean: average of all observations.

\[ \bar{x} = \frac{\sum x_i}{n} \]

    Median: middle observation when sorted in order from least to greatest.

    Mode: value that appears most often.

Numerical variables

trial$age
  [1] 23  9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
 [26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
 [51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
 [76] 54 67 43 54 41 34 34  6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"

Numerical variables

trial$age
  [1] 23  9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
 [26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
 [51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
 [76] 54 67 43 54 41 34 34  6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"

Mean:

mean(trial$age, na.rm = TRUE)
[1] 47.2381

Numerical variables

trial$age
  [1] 23  9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
 [26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
 [51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
 [76] 54 67 43 54 41 34 34  6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"

Median:

median(trial$age, na.rm = TRUE)
[1] 47

Numerical variables

trial$age
  [1] 23  9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
 [26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
 [51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
 [76] 54 67 43 54 41 34 34  6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"

Mode:

#install.packages("DescTools")
library(DescTools)
Mode(trial$age, na.rm = TRUE)
[1] 31 38 43 47 48
attr(,"freq")
[1] 7

Numerical variables

  • The most common measures of variability are:

    • Standard Deviation (SD): average distance from the mean.

    \[ \text{SD} = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n - 1}} \]

    • Range: difference between the maximum and minimum values.

    • Interquartile range: difference between the percentile 75th (Q3) and the percentile 25th (Q1).

A p-th percentile is the value below which there is a given percentage p of observations with an equal or lower value. The percentiles 25th, 50th and 75th are called quartiles and divide the values into four equal parts. The second quartile 50th (Q2) is equal to the median.

Numerical variables

trial$age
  [1] 23  9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
 [26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
 [51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
 [76] 54 67 43 54 41 34 34  6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"

Numerical variables

trial$age
  [1] 23  9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
 [26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
 [51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
 [76] 54 67 43 54 41 34 34  6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"

Standard Deviation:

sd(trial$age, na.rm = TRUE)
[1] 14.31193

Numerical variables

trial$age
  [1] 23  9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
 [26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
 [51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
 [76] 54 67 43 54 41 34 34  6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"

Range:

range(trial$age, na.rm = TRUE)
[1]  6 83

Numerical variables

trial$age
  [1] 23  9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
 [26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
 [51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
 [76] 54 67 43 54 41 34 34  6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"

Percentiles:

#Summary of percentiles
quantile(trial$age, na.rm = TRUE)
  0%  25%  50%  75% 100% 
   6   38   47   57   83 

Numerical variables

trial$age
  [1] 23  9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
 [26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
 [51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
 [76] 54 67 43 54 41 34 34  6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"

Percentiles:

#To calculate a given percentile (90th)
quantile(trial$age, p = 0.9, na.rm = TRUE)
 90% 
66.2 

Numerical variables

trial$age
  [1] 23  9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
 [26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
 [51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
 [76] 54 67 43 54 41 34 34  6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"

Interquartile range:

#To calculate a given percentile (90th)
quantile(trial$age, p = 0.75, na.rm = TRUE) - quantile(trial$age, p = 0.25, na.rm = TRUE)
75% 
 19 

Numerical variables

trial$age
  [1] 23  9 31 NA 51 39 37 32 31 34 42 63 54 21 48 71 38 49 57 46 47 52 61 38 34
 [26] 49 63 67 68 78 36 37 53 36 51 48 57 31 37 28 40 49 61 56 54 71 38 31 48 NA
 [51] 83 52 32 53 69 60 45 39 NA 38 36 71 31 43 57 53 25 44 25 30 51 40 NA 43 21
 [76] 54 67 43 54 41 34 34  6 39 36 58 27 47 NA 50 61 47 52 51 68 33 65 34 38 60
[101] 10 49 56 50 60 49 54 39 48 65 47 61 34 NA NA 58 26 44 17 68 57 66 44 NA 67
[126] 48 62 35 53 53 66 55 57 47 58 43 45 44 63 59 44 53 51 28 65 63 76 61 33 48
[151] 42 36 55 20 26 50 47 74 50 31 45 51 66 76 47 48 56 70 46 43 41 41 19 49 43
[176] 43 75 52 42 37 45 35 67 38 44 45 39 46 NA 42 60 31 45 38 NA 19 69 66 NA 64
attr(,"label")
[1] "Age"

The summary() function directly returns a number of different descriptive statistics:

summary(trial$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   6.00   38.00   47.00   47.24   57.00   83.00      11 

Numerical variables

  • In a normal distribution, the mean (\(\bar{x}\)) and standard deviation (SD) fully characterize the distribution:

Numerical variables

  • Let’s show these statistics alongside the distribution for the age:

Numerical variables

  • Let’s show these statistics alongside the distribution for the age:

Numerical variables

  • Let’s show these statistics alongside the distribution for the age:

Numerical variables

  • Let’s show these statistics for a non-normal distribution (marker level):

Numerical variables

  • Let’s show these statistics for a non-normal distribution (marker level):
Mean - 2*SD = -0.803

Numerical variables

  • Let’s show these statistics for a non-normal distribution (marker level):
Mean - 2*SD = -0.803

Numerical variables

  • The choice of these measures depends on the type of distribution that the variable has:

    Normally distributed variables are normally described with the mean and standard deviation.

    Non-normal variables are normally described with the median and the first and third quartiles (Q1 and Q3) or the interquartile range.

N = 200
Age, Mean (SD) 47 (14)
Marker Level (ng/mL), Median (Q1, Q3) 0.64 (0.22, 1.41)

Numerical variables

  • The choice of these measures depends on the type of distribution that the variable has:

    Normally distributed variables are normally described with the mean and standard deviation.

    Non-normal variables are normally described with the median and the first and third quartiles (Q1 and Q3) or the interquartile range.

Drug A
N = 98
Drug B
N = 102
Age, Mean (SD) 47 (15) 47 (14)
Marker Level (ng/mL), Median (Q1, Q3) 0.84 (0.23, 1.60) 0.52 (0.18, 1.21)

gtsummary

{gtsummary} package

  • R package that provides an elegant and flexible way to create publication-ready analytical and summary tables.

  • It allows to summarize data sets, regression models, and more.

  • The tbl_summary() function calculates descriptive statistics for continuous and categorical columns of a dataframe or tibble.






Sjoberg DD, Whiting K, Curry M, Lavery JA, Larmarange J. Reproducible summary tables with the gtsummary package. The R Journal 2021;13:570–80. https://doi.org/10.32614/RJ-2021-053.

tbl_summary()

  • Four types of summaries: continuous, continuous2, categorical, and dichotomous

  • Variables coded 0/1, FALSE/TRUE, No/Yes treated as dichotomous

  • Statistics are median (p25, p75) for continuous, n (%) for categorical/dichotomous

  • NA values will be shown as “Unknown”

  • Label attributes are printed automatically

library(gtsummary)
trial |> 
  tbl_summary(include = c("age", "marker", "grade", "response"))
Characteristic N = 2001
Age 47 (38, 57)
    Unknown 11
Marker Level (ng/mL) 0.64 (0.22, 1.41)
    Unknown 10
Grade
    I 68 (34%)
    II 68 (34%)
    III 64 (32%)
Tumor Response 61 (32%)
    Unknown 7
1 Median (Q1, Q3); n (%)

Customize the table

trial |> 
  tbl_summary(
    include = c("age", "marker", "grade", "response")
  )
Characteristic N = 2001
Age 47 (38, 57)
    Unknown 11
Marker Level (ng/mL) 0.64 (0.22, 1.41)
    Unknown 10
Grade
    I 68 (34%)
    II 68 (34%)
    III 64 (32%)
Tumor Response 61 (32%)
    Unknown 7
1 Median (Q1, Q3); n (%)

Customize the table

trial |> 
  tbl_summary(
    include = c("age", "marker", "grade", "response"),
    by = trt
  )
Characteristic Drug A
N = 981
Drug B
N = 1021
Age 46 (37, 60) 48 (39, 56)
    Unknown 7 4
Marker Level (ng/mL) 0.84 (0.23, 1.60) 0.52 (0.18, 1.21)
    Unknown 6 4
Grade

    I 35 (36%) 33 (32%)
    II 32 (33%) 36 (35%)
    III 31 (32%) 33 (32%)
Tumor Response 28 (29%) 33 (34%)
    Unknown 3 4
1 Median (Q1, Q3); n (%)

Customize the table

trial |> 
  tbl_summary(
    include = c("age", "marker", "grade", "response"),
    by = trt,
    missing = "no"
  )
Characteristic Drug A
N = 981
Drug B
N = 1021
Age 46 (37, 60) 48 (39, 56)
Marker Level (ng/mL) 0.84 (0.23, 1.60) 0.52 (0.18, 1.21)
Grade

    I 35 (36%) 33 (32%)
    II 32 (33%) 36 (35%)
    III 31 (32%) 33 (32%)
Tumor Response 28 (29%) 33 (34%)
1 Median (Q1, Q3); n (%)

Customize the table

trial |> 
  tbl_summary(
    include = c("age", "marker", "grade", "response"),
    by = trt,
    missing = "no",
    percent = "row"
  )
Characteristic Drug A
N = 981
Drug B
N = 1021
Age 46 (37, 60) 48 (39, 56)
Marker Level (ng/mL) 0.84 (0.23, 1.60) 0.52 (0.18, 1.21)
Grade

    I 35 (51%) 33 (49%)
    II 32 (47%) 36 (53%)
    III 31 (48%) 33 (52%)
Tumor Response 28 (46%) 33 (54%)
1 Median (Q1, Q3); n (%)

Customize the table

trial |> 
  tbl_summary(
    include = c("age", "marker", "grade", "response"),
    by = trt,
    missing = "no",
    percent = "row",
    label = age ~ "Patient Age"
  )
Characteristic Drug A
N = 981
Drug B
N = 1021
Patient Age 46 (37, 60) 48 (39, 56)
Marker Level (ng/mL) 0.84 (0.23, 1.60) 0.52 (0.18, 1.21)
Grade

    I 35 (51%) 33 (49%)
    II 32 (47%) 36 (53%)
    III 31 (48%) 33 (52%)
Tumor Response 28 (46%) 33 (54%)
1 Median (Q1, Q3); n (%)

Customize the table

trial |> 
  tbl_summary(
    include = c("age", "marker", "grade", "response"),
    by = trt,
    missing = "no",
    percent = "row",
    label = age ~ "Patient Age",
    statistic = age ~ "{mean} ({sd})"
  )
Characteristic Drug A
N = 981
Drug B
N = 1021
Patient Age 47 (15) 47 (14)
Marker Level (ng/mL) 0.84 (0.23, 1.60) 0.52 (0.18, 1.21)
Grade

    I 35 (51%) 33 (49%)
    II 32 (47%) 36 (53%)
    III 31 (48%) 33 (52%)
Tumor Response 28 (46%) 33 (54%)
1 Mean (SD); Median (Q1, Q3); n (%)

Customize the table

trial |> 
  tbl_summary(
    include = c("age", "marker", "grade", "response"),
    by = trt,
    missing = "no",
    percent = "row",
    label = age ~ "Patient Age",
    statistic = age ~ "{mean} ({sd})",
    type = response ~ "categorical"
  )
Characteristic Drug A
N = 981
Drug B
N = 1021
Patient Age 47 (15) 47 (14)
Marker Level (ng/mL) 0.84 (0.23, 1.60) 0.52 (0.18, 1.21)
Grade

    I 35 (51%) 33 (49%)
    II 32 (47%) 36 (53%)
    III 31 (48%) 33 (52%)
Tumor Response

    0 67 (51%) 65 (49%)
    1 28 (46%) 33 (54%)
1 Mean (SD); Median (Q1, Q3); n (%)

gtsummary syntax

  • To specify multiple instructions we have to use lists

Useful resources for {gtsummary}

Descriptive graphics

Categorical variables

  • Can be graphically described with a bar plot showing the number of units in each category.

Categorical variables

  • Can be graphically described with a bar plot showing the number of units in each category.

Warning

Pie charts are not a good way of describing categorical data because they become difficult to read as the number of categories increases.

Categorical variables

  • Can be graphically described with a bar plot showing the number of units in each category.

Categorical variables

  • Can be graphically described with a bar plot showing the number of units in each category.

Categorical variables

  • Can be graphically described with a bar plot showing the number of units in each category.

Numerical variables

  • The objective is usually to visualize the shape of a variable distribution.

  • The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.

Numerical variables

  • The objective is usually to visualize the shape of a variable distribution.

  • The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.

    • Histograms are constructed by binning the data and counting the number of observations in each bin:

Numerical variables

  • The objective is usually to visualize the shape of a variable distribution.

  • The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.

    • Histograms are constructed by binning the data and counting the number of observations in each bin:

Numerical variables

  • The objective is usually to visualize the shape of a variable distribution.

  • The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.

    • A density plot can be thought as a smoothed histogram representing an estimation of the probability density function of a continuous random variable:

Numerical variables

  • The objective is usually to visualize the shape of a variable distribution.

  • The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.

    • A density plot can be thought as a smoothed histogram representing an estimation of the probability density function of a continuous random variable:

Numerical variables

  • The objective is usually to visualize the shape of a variable distribution.

  • The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.

    • A density plot can be thought as a smoothed histogram representing an estimation of the probability density function of a continuous random variable:

Numerical variables

  • The objective is usually to visualize the shape of a variable distribution.

  • The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.

    • A box plot describes the distribution of a numeric variable by showing the percentiles of the variable:

Numerical variables

  • The objective is usually to visualize the shape of a variable distribution.

  • The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.

    • A box plot describes the distribution of a numeric variable by showing the percentiles of the variable:

Numerical variables

  • The objective is usually to visualize the shape of a variable distribution.

  • The most common types of graphical representation of the distribution of a numerical variable are an histogram, density plot or box plot.

    • A box plot describes the distribution of a numeric variable by showing the percentiles of the variable:

ggplot2

The {ggplot2} package

  • Is the most popular R package for producing visualizations of data.

  • Unlike many graphics packages, ggplot2 uses a conceptual framework based on the grammar of graphics.

  • It’s part of the tidyverse universe, but uses + instead of a pipe operator (|> or %>%).

The {ggplot2} package

Bar plot

  • Let’s create a bar plot using ggplot2:

Bar plot

  • Let’s create a bar plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = stage))

Bar plot

  • Let’s create a bar plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = stage)) +
  #Create bar plot layer:
  geom_bar()

Bar plot

  • Let’s create a bar plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = stage)) +
  #Create bar plot layer:
  geom_bar(fill = "grey", color = "black", width = .8)

Bar plot

  • Let’s create a bar plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = stage)) +
  #Create bar plot layer:
  geom_bar(fill = "grey", color = "black", width = .8) +
  #Change x scale:
  scale_x_discrete(name = "T Stage") +
  #Change y scale:
  scale_y_continuous(name = "Counts", limits = c(0, 60), breaks = seq(0, 60, by = 10))

Bar plot

  • Let’s create a bar plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = stage)) +
  #Create bar plot layer:
  geom_bar(fill = "grey", color = "black", width = .8) +
  #Change x scale:
  scale_x_discrete(name = "T Stage") +
  #Change y scale:
  scale_y_continuous(name = "Counts", limits = c(0, 60), breaks = seq(0, 60, by = 10)) +
  #Apply a black & white theme:
  theme_bw()

Bar plot

  • Let’s create a bar plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = stage, fill = trt)) +
  #Create bar plot layer:
  geom_bar(color = "black", width = .8) +
  #Change x scale:
  scale_x_discrete(name = "T Stage") +
  #Change y scale:
  scale_y_continuous(name = "Counts", limits = c(0, 60), breaks = seq(0, 60, by = 10)) +
  #Apply a black & white theme:
  theme_bw()

Bar plot

  • Let’s create a bar plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = stage, fill = trt)) +
  #Create bar plot layer:
  geom_bar(color = "black", width = .8) +
  #Change x scale:
  scale_x_discrete(name = "T Stage") +
  #Change y scale:
  scale_y_continuous(name = "Counts", limits = c(0, 60), breaks = seq(0, 60, by = 10)) +
  #Change fill legend:
  scale_fill_discrete(name = "Treatment") +
  #Apply a black & white theme:
  theme_bw()

Bar plot

  • Let’s create a bar plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = stage, fill = trt)) +
  #Create bar plot layer:
  geom_bar(color = "black", width = .8, position = position_dodge()) +
  #Change x scale:
  scale_x_discrete(name = "T Stage") +
  #Change y scale:
  scale_y_continuous(name = "Counts", limits = c(0, 60), breaks = seq(0, 60, by = 10)) +
  #Change fill legend:
  scale_fill_discrete(name = "Treatment") +
  #Apply a black & white theme:
  theme_bw()

Bar plot

  • Let’s create a bar plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = stage, fill = trt)) +
  #Create bar plot layer:
  geom_bar(color = "black", width = .8, position = position_dodge()) +
  #Change x scale:
  scale_x_discrete(name = "T Stage") +
  #Change y scale:
  scale_y_continuous(name = "Counts", limits = c(0, 30), breaks = seq(0, 30, by = 5)) +
  #Change fill legend:
  scale_fill_discrete(name = "Treatment") +
  #Apply a black & white theme:
  theme_bw()

Bar plot

  • Let’s create a bar plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = stage, fill = trt)) +
  #Create bar plot layer:
  geom_bar(color = "black", width = .8, position = "fill") +
  #Change x scale:
  scale_x_discrete(name = "T Stage") +
  #Change y scale:
  scale_y_continuous(name = "Percentage") +
  #Change fill legend:
  scale_fill_discrete(name = "Treatment") +
  #Apply a black & white theme:
  theme_bw()

Bar plot

  • Let’s create a bar plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = stage, fill = trt)) +
  #Create bar plot layer:
  geom_bar(color = "black", width = .8, position = "fill") +
  #Change x scale:
  scale_x_discrete(name = "T Stage") +
  #Change y scale:
  scale_y_continuous(name = "Percentage", labels = scales::percent) +
  #Change fill legend:
  scale_fill_discrete(name = "Treatment") +
  #Apply a black & white theme:
  theme_bw()

Histogram plot

  • Now let’s create a histogram plot using ggplot2:

Histogram plot

  • Now let’s create a histogram plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = age))

Histogram plot

  • Now let’s create a histogram plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = age)) +
  #Create histogram plot layer:
  geom_histogram()

Histogram plot

  • Now let’s create a histogram plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = age)) +
  #Create histogram plot layer:
  geom_histogram(binwidth = 5, fill = "grey", color = "black")

Histogram plot

  • Now let’s create a histogram plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = age)) +
  #Create histogram plot layer:
  geom_histogram(binwidth = 5, fill = "grey", color = "black") +
  #Change x scale:
  scale_x_continuous(name = "Age") +
  #Change y scale:
  scale_y_continuous(name = "Counts")

Histogram plot

  • Now let’s create a histogram plot using ggplot2:
#Define data and mapping:
ggplot(data = trial, aes(x = age)) +
  #Create histogram plot layer:
  geom_histogram(binwidth = 5, fill = "grey", color = "black") +
  #Change x scale:
  scale_x_continuous(name = "Age") +
  #Change y scale:
  scale_y_continuous(name = "Counts") +
  #Apply a black & white theme:
  theme_bw()

Histogram plot

  • Now let’s create a histogram plot using ggplot2:
#Define data and mapping:
ggplot(data = trial, aes(x = age, fill = trt)) +
  #Create histogram plot layer:
  geom_histogram(binwidth = 5, color = "black") +
  #Change x scale:
  scale_x_continuous(name = "Age") +
  #Change y scale:
  scale_y_continuous(name = "Counts") +
  #Apply a black & white theme:
  theme_bw()

Histogram plot

  • Now let’s create a histogram plot using ggplot2:
#Define data and mapping:
ggplot(data = trial, aes(x = age, fill = trt)) +
  #Create histogram plot layer:
  geom_histogram(binwidth = 5, color = "black") +
  #Change x scale:
  scale_x_continuous(name = "Age") +
  #Change y scale:
  scale_y_continuous(name = "Counts") +
  #Change fill legend:
  scale_fill_discrete(name = "Treatment") +
  #Apply a black & white theme:
  theme_bw()

Density plot

  • Now create a density plot using ggplot2:

Density plot

  • Now let’s create a density plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = age))

Density plot

  • Now let’s create a density plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = age)) +
  #Create density plot layer:
  geom_density()

Density plot

  • Now let’s create a density plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = age)) +
  #Create histogram plot layer:
  geom_density(alpha = .6, fill = "#68abb8")

Density plot

  • Now let’s create a density plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(x = age)) +
  #Create histogram plot layer:
  geom_density(alpha = .6, fill = "#68abb8") +
  #Change x scale:
  scale_x_continuous(name = "Age", limits = c(0, 90), breaks = c(0, 25, 50, 75)) +
  #Change y scale:
  scale_y_continuous(name = "Density")

Density plot

  • Now let’s create a density plot using ggplot2:
#Define data and mapping:
ggplot(data = trial, aes(x = age)) +
  #Create histogram plot layer:
  geom_density(alpha = .6, fill = "#68abb8") +
  #Change x scale:
  scale_x_continuous(name = "Age", limits = c(0, 90), breaks = c(0, 25, 50, 75)) +
  #Change y scale:
  scale_y_continuous(name = "Density") +
  #Apply a black & white theme:
  theme_bw()

Density plot

  • Now let’s create a density plot using ggplot2:
#Define data and mapping:
ggplot(data = trial, aes(x = age, fill = trt)) +
  #Create histogram plot layer:
  geom_density(alpha = .6) +
  #Change x scale:
  scale_x_continuous(name = "Age", limits = c(0, 90), breaks = c(0, 25, 50, 75)) +
  #Change y scale:
  scale_y_continuous(name = "Density") +
  #Apply a black & white theme:
  theme_bw()

Density plot

  • Now let’s create a density plot using ggplot2:
#Define data and mapping:
ggplot(data = trial, aes(x = age, fill = trt)) +
  #Create histogram plot layer:
  geom_density(alpha = .6) +
  #Change x scale:
  scale_x_continuous(name = "Age", limits = c(0, 90), breaks = c(0, 25, 50, 75)) +
  #Change y scale:
  scale_y_continuous(name = "Density") +
  #Change fill legend:
  scale_fill_discrete(name = "Treatment") +
  #Apply a black & white theme:
  theme_bw()

Box plot

  • Now create a box plot using ggplot2:

Box plot

  • Now let’s create a box plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(y = age))

Box plot

  • Now let’s create a box plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(y = age)) +
  #Create box plot layer:
  geom_boxplot()

Box plot

  • Now let’s create a box plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(y = age)) +
  #Create box plot layer:
  geom_boxplot(alpha = .6, fill = "#68abb8")

Box plot

  • Now let’s create a box plot using ggplot2:
library(ggplot2)

#Define data and mapping:
ggplot(data = trial, aes(y = age)) +
  #Create box plot layer:
  geom_boxplot(alpha = .6, fill = "#68abb8") +
  #Change x scale:
  scale_x_continuous(limits = c(-1, 1)) +
  #Change y scale:
  scale_y_continuous(name = "Age", limits = c(0, 90))

Box plot

  • Now let’s create a box plot using ggplot2:
#Define data and mapping:
ggplot(data = trial, aes(y = age)) +
  #Create box plot layer:
  geom_boxplot(alpha = .6, fill = "#68abb8") +
  #Change x scale:
  scale_x_continuous(limits = c(-1, 1)) +
  #Change y scale:
  scale_y_continuous(name = "Age", limits = c(0, 90)) +
  #Apply a black & white theme:
  theme_bw()

Box plot

  • Now let’s create a box plot using ggplot2:
#Define data and mapping:
ggplot(data = trial, aes(y = age)) +
  #Create box plot layer:
  geom_boxplot(alpha = .6, fill = "#68abb8") +
  #Change x scale:
  scale_x_continuous(limits = c(-1, 1)) +
  #Change y scale:
  scale_y_continuous(name = "Age", limits = c(0, 90)) +
  #Apply a black & white theme:
  theme_bw() +
  #Apply another theme to remove the x axis ticks and labels
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

Box plot

  • Now let’s create a box plot using ggplot2:
#Define data and mapping:
ggplot(data = trial, aes(x = trt, y = age)) +
  #Create box plot layer:
  geom_boxplot(alpha = .6, fill = "#68abb8") +
  #Change y scale:
  scale_y_continuous(name = "Age", limits = c(0, 90)) +
  #Apply a black & white theme:
  theme_bw() 

Box plot

  • Now let’s create a box plot using ggplot2:
#Define data and mapping:
ggplot(data = trial, aes(x = trt, y = age, fill = trt)) +
  #Create box plot layer:
  geom_boxplot(alpha = .6) +
  #Change y scale:
  scale_y_continuous(name = "Age", limits = c(0, 90)) +
  #Apply a black & white theme:
  theme_bw() 

Box plot

  • Now let’s create a box plot using ggplot2:
#Define data and mapping:
ggplot(data = trial, aes(x = trt, y = age, fill = trt)) +
  #Create box plot layer:
  geom_boxplot(alpha = .6) +
  #Change y scale:
  scale_y_continuous(name = "Age", limits = c(0, 90)) +
  #Change fill legend:
  scale_fill_discrete(name = "Treatment") +
  #Apply a black & white theme:
  theme_bw() 

Box plot

  • Now let’s create a box plot using ggplot2:
#Define data and mapping:
ggplot(data = trial, aes(x = trt, y = age, fill = trt)) +
  #Create box plot layer:
  geom_boxplot(alpha = .6) +
  #Change x scale:
  scale_x_discrete(name = "Treatment") +
  #Change y scale:
  scale_y_continuous(name = "Age", limits = c(0, 90)) +
  #Apply a black & white theme:
  theme_bw() +
  #Remove legend:
  theme(legend.position="none")

Useful resources for {ggplot2}

Comparison between groups

Introduction

  • For example:

    • A researcher may compare the performance of students in two different schools, or compare the performance of students in two different grade levels.

    • A researcher may compare the same group of people before and after taking a medication or compare the productivity of employees before and after a training program.

Type of samples

  • Independent samples:

    • Independent samples are samples that are selected randomly so that its observations do not depend on the values other observations.

    • For example, if the men’s group and the women’s group are asked about their health status.


  • Paired samples:

    • In a dependent sample, the measures are related.

    • For example, if you take a sample of patients who have had a painkiller and ask them about their pain before and after taking the medicine

Independent Samples

Independent Samples - Continuous Data

  • t-Test

When to use

  • The samples come from normally distributed populations.

  • If the populations have unequal variances, the Welch modification is used.

  • Wilcoxon rank-sum test

When to use

  • Normality is questionable or sample sizes are small.

Independent Samples - Continuous Data

  • How do you check for normality? (e.g., QQ-plots)
ggplot(trial, aes(sample = age)) +
  geom_qq() +
  geom_qq_line() +
  labs(x = "Theoretical Quantiles", 
       y = "Sample Quantiles") +
  theme_bw()

t-Test

  • Hypothesis Testing:

    • Null Hypothesis (H₀): The means are equal across groups.

    • Alternative Hypothesis (H₁): The means are different across groups.

  • Formula: equal variance vs unequal variance

\[ t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \]

\[ t_\text{Welch} = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{(s_1)^2}{n_1} + \frac{(s_2)^2}{n_2}}} \]

\(\bar{x}_1, \bar{x}_2 = \text{Sample means}\)

\(s_1, s_2 = \text{Sample standard deviation}\)

\(s_p = \text{Pooled standard deviation}\)

\(n_1, n_2 = \text{Sample size}\)

t-Test

  • Example: Comparing the average age between the two treatment groups.
t.test(age ~ trt, data = trial)

    Welch Two Sample t-test

data:  age by trt
t = -0.2093, df = 184.19, p-value = 0.8344
alternative hypothesis: true difference in means between group Drug A and group Drug B is not equal to 0
95 percent confidence interval:
 -4.566621  3.690640
sample estimates:
mean in group Drug A mean in group Drug B 
            47.01099             47.44898 

t-Test

  • If you want to test whether the average age in Group A is less than the average age in Group B:
t.test(age ~ trt, data = trial, alternative = "less")


  • Or, if you want to test whether the average age in Group A is greater than the average age in Group B:
t.test(age ~ trt, data = trial, alternative = "greater")

t-Test

  • Using gtsummary:
trial |> 
  select(age, trt) |> 
  tbl_summary(by = trt,
              statistic = age ~ "{mean} ({sd})") |> 
  add_p(test = age ~ "t.test")
Characteristic Drug A
N = 981
Drug B
N = 1021
p-value2
Age 47 (15) 47 (14) 0.8
    Unknown 7 4
1 Mean (SD)
2 Welch Two Sample t-test


Use test = age ~ "t.test" to apply an independent t-test to compare age means between treatment groups.

Wilcoxon rank-sum test

  • Hypothesis Testing:

    • Null Hypothesis (H₀): The distributions of both groups are equal.

    • Alternative Hypothesis (H₁): The distributions of both groups are different.

  • Formula:

\[W = R_1 - \frac{n_1(n_1 +1)}{2}\]

\(R_1 = \text{Sum of ranks for the reference group}\)

\(n_1 = \text{Number of observations in the reference group}\)

Wilcoxon rank-sum test

Wilcoxon rank-sum test

  • Example: Comparing age distribution between the two treatment groups.
wilcox.test(age ~ trt, data = trial)

    Wilcoxon rank sum test with continuity correction

data:  age by trt
W = 4323, p-value = 0.7183
alternative hypothesis: true location shift is not equal to 0

Wilcoxon rank-sum test

  • Example: Comparing age distribution between the two treatment groups.
wilcox.test(age ~ trt, data = trial, conf.int = TRUE)

    Wilcoxon rank sum test with continuity correction

data:  age by trt
W = 4323, p-value = 0.7183
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 -4.999980  3.999954
sample estimates:
difference in location 
            -0.9999612 

Wilcoxon rank-sum test

  • Using gtsummary:
trial |> 
  select(age, marker, trt) |> 
  tbl_summary(by = "trt") |> 
  add_p(test = age ~ "wilcox.test")
Characteristic Drug A
N = 981
Drug B
N = 1021
p-value2
Age 46 (37, 60) 48 (39, 56) 0.7
    Unknown 7 4
Marker Level (ng/mL) 0.84 (0.23, 1.60) 0.52 (0.18, 1.21) 0.085
    Unknown 6 4
1 Median (Q1, Q3)
2 Wilcoxon rank sum test


By default, add_p() uses the Wilcoxon rank-sum test for continuous variables.

Independent Samples - Categorical Data

  • Chi-square test

When to use

  • Compares observed frequencies to expected frequencies.

  • Appropriate when sample sizes are large (expected cell counts are ≥ 5).

  • Fisher’s exact test

When to use

  • Ideal for small sample sizes or when expected cell counts are less than 5.

Independent Samples - Categorical Data


\[E = \frac{(\text{Row total} \times \text{Column total})}{\text{Grand total}} = \frac{28*33}{48} = 19.25\]

Chi-Square Test

  • Hypothesis Testing:

    • Null Hypothesis (H₀): There is no association between the categorical variables.

    • Alternative Hypothesis (H₁): An association exists between the categorical variables.

  • Formula:

\[\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}\]

\(O_i = \text{Observed frequency}\)

\(E_i = \text{Expected frequency}\)

Chi-Square Test

  • Example: Testing the association between treatment group (trt) and cancer stage (stage).
chisq.test(trial$stage, trial$trt)

    Pearson's Chi-squared test

data:  trial$stage and trial$trt
X-squared = 0.72966, df = 3, p-value = 0.8662

Chi-Square Test

  • Using gtsummary:
trial |> 
  select(stage, trt) |> 
  tbl_summary(by = "trt") |> 
  add_p(test = stage ~ "chisq.test")
Characteristic Drug A
N = 981
Drug B
N = 1021
p-value2
T Stage

0.9
    T1 28 (29%) 25 (25%)
    T2 25 (26%) 29 (28%)
    T3 22 (22%) 21 (21%)
    T4 23 (23%) 27 (26%)
1 n (%)
2 Pearson’s Chi-squared test


The default test for most categorical data is the chi-square test.

If expected counts are low (<5), Fisher’s exact test is used instead.

Paired Samples

Paired Samples - Continuous Data

  • Paired t-Test

When to use

  • The differences between paired observations are normally distributed.
  • Wilcoxon Signed-Rank Test

When to use

  • Normality of the paired differences is questionable or sample sizes are small.

Paired t-Test

  • Hypothesis Testing:

    • Null Hypothesis (H₀): Mean difference between the paired samples is zero.

    • Alternative Hypothesis (H₁): Mean difference between the paired samples is not equal to zero.

  • Formula:

\[t = \frac{\bar{d}}{\frac{s_d}{\sqrt{n}}}\]

\(\bar{d} = \text{Mean of differences}\)

\(s_d = \text{Standard deviation of differences}\)

\(n = \text{Number of pairs}\)

Paired t-Test

  • Example: Comparing pre-treatment and post-treatment increase in sleep hours.
# Loading data
data("sleep", package = "datasets")
extra group ID
0.7 1 1
1.9 2 1
-1.6 1 2
0.8 2 2
# Creating the objects pre & post intervention
pre_interv <- sleep |> filter(group == 1) |> pull(extra)

post_interv <- sleep |> filter(group == 2) |> pull(extra)

# Paired t-test
t.test(pre_interv, post_interv, paired = TRUE)

Paired t-Test

  • Example: Comparing pre-treatment and post-treatment increase in sleep hours.
t.test(pre_interv, post_interv, paired = TRUE)

    Paired t-test

data:  pre_interv and post_interv
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -2.4598858 -0.7001142
sample estimates:
mean difference 
          -1.58 

Paired t-Test

  • Using gtsummary:
sleep |> 
  tbl_summary(by = group, 
              include = extra,
              statistic = extra ~ "{mean} ({sd})") |> 
  add_p(test = extra ~ "paired.t.test",
        group = ID)
Characteristic 1
N = 101
2
N = 101
p-value2
extra 0.75 (1.79) 2.33 (2.00) 0.003
1 Mean (SD)
2 Paired t-test

Wilcoxon Signed-Rank Test

  • Hypothesis Testing:

    • Null Hypothesis (H₀): The differences between paired observations are symmetrically distributed around zero.

    • Alternative Hypothesis (H₁): The differences are not symmetrically distributed around zero.

  • Formula:

\[ W = \sum_{i=1}^{N_r} \left[ \operatorname{sgn}(x_{2,i} - x_{1,i}) \cdot R_i \right] \]

\(x_{1,i}, x_{2,i} = \text{paired ranks from two different distributions}\)

\(R_i = \text{rank } i\)

Wilcoxon Signed-Rank Test

  • Example: Comparing pre-treatment and post-treatment increase in sleep hours.
wilcox.test(pre_interv, post_interv, paired = TRUE)

    Wilcoxon signed rank test with continuity correction

data:  pre_interv and post_interv
V = 0, p-value = 0.009091
alternative hypothesis: true location shift is not equal to 0

Wilcoxon Signed-Rank Test

  • Example: Comparing pre-treatment and post-treatment increase in sleep hours.
wilcox.test(pre_interv, post_interv, paired = TRUE, conf.int = TRUE)

    Wilcoxon signed rank test with continuity correction

data:  pre_interv and post_interv
V = 0, p-value = 0.009091
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 -2.949921 -1.050018
sample estimates:
(pseudo)median 
     -1.400031 

Wilcoxon Signed-Rank Test

  • Using gtsummary:
sleep |> 
  tbl_summary(by = group, 
              include = extra) |> 
  add_p(test = extra ~ "paired.wilcox.test",
        group = ID)
Characteristic 1
N = 101
2
N = 101
p-value2
extra 0.35 (-0.20, 2.00) 1.75 (0.80, 4.40) 0.009
1 Median (Q1, Q3)
2 Wilcoxon signed rank test with continuity correction

Paired Samples - Categorical Data

  • McNemar Test

When to use

  • Compare proportions or frequencies between two related groups.

  • Determine if the proportion of disagreement between two conditions is equal

McNemar Test

  • Hypothesis Testing:

    • Null Hypothesis (H₀): The proportions of discordant pairs are equal.

    • Alternative Hypothesis (H₁): The proportions of discordant pairs differ.

  • Formula:

\[ \chi^2 = \frac{(b - c)^2}{b + c} \]

McNemar Test

  • Example: Testing whether there is a difference in sleeping hours before and after treatment.
# Creating a categorical version of the variable `extra`
sleep <- sleep |> 
  mutate(extra_cat = case_when(extra >= 0 ~ "Positive",
                               extra < 0 ~ "Negative"),
         extra_cat = factor(extra_cat))

# Creating the objects pre & post intervention
pre_interv_cat <- sleep |> filter(group == 1) |> pull(extra_cat)

post_interv_cat <- sleep |> filter(group == 2) |> pull(extra_cat)

# McNemar Test
mcnemar.test(pre_interv_cat, post_interv_cat)

McNemar Test

  • Example: Testing whether there is a difference in sleeping hours before and after treatment.
mcnemar.test(pre_interv_cat, post_interv_cat)

    McNemar's Chi-squared test with continuity correction

data:  pre_interv_cat and post_interv_cat
McNemar's chi-squared = 1.3333, df = 1, p-value = 0.2482

McNemar Test

  • Using gtsummary:
sleep |> 
  tbl_summary(by = group, 
              include = extra_cat) |> 
  add_p(test = extra_cat ~ "mcnemar.test",
        group = ID)
Characteristic 1
N = 101
2
N = 101
p-value2
extra_cat

0.2
    Negative 4 (40%) 1 (10%)
    Positive 6 (60%) 9 (90%)
1 n (%)
2 McNemar’s Chi-squared test with continuity correction

Available tests in gtsummary

Link: https://www.danieldsjoberg.com/gtsummary/reference/tests.html

Effect size

Effect size

Recapping:

A statistically significant result does not indicate the size of the effect or its clinical relevance.

The clinical significance of a finding is determined by assessing whether the effect is large enough to influence medical practice or decision making.

  • With large samples, even insignificant differences can be statistically significant.

  • With small samples, even large differences can be statistically non-significant.

Important

Report the effect size

Effect size - t-test

  • Using gtsummary:
trial |> 
  select(age, trt) |> 
  tbl_summary(by = trt,
              statistic = age ~ "{mean} ({sd})",
              digits = age ~ c(2, 2)) |> 
  add_difference()
Characteristic Drug A
N = 98
1
Drug B
N = 102
1
Difference2 95% CI2 p-value2
Age 47.01 (14.71) 47.45 (14.01) -0.44 -4.6, 3.7 0.8
    Unknown 7 4


Abbreviation: CI = Confidence Interval
1 Mean (SD)
2 Welch Two Sample t-test
  • The difference is the raw difference in means between groups, calculated using a t-test. The p-value comes from the t_test().

Effect size - Cohen’s d

  • Cohen’s d is a standardized effect size for measuring the difference between two group means:

\[d = \frac{\bar{x}_1 - \bar{x}_2}{s_p}\]

\(\bar{x}_1, \bar{x}_2 = \text{Sample means}\)

\(s_p = \text{Pooled standard deviation}\)

The d statistic redefines the difference in means as the number of standard deviations that separates those means.

Effect size - Cohen’s d

  • How to do it in R:
#install.packages(effectsize)
library(effectsize)
cohens_d(age ~ trt, data = trial)
Cohen's d |        95% CI
-------------------------
-0.03     | [-0.32, 0.25]

- Estimated using pooled SD.
  • Effect size interpretation (Cohen, J. 1998)
d value Rough interpretation
0.2 ≤ d < 0.5 Small effect
0.5 ≤ d < 0.8 Moderate effect
d ≥ 0.8 Large effect

A difference smaller than 0.2 standard deviations is considered trivial, even if statistically significant.

Effect size - Cohen’s d

  • Using gtsummary:
trial |> 
  select(age, trt) |> 
  tbl_summary(by = trt,
              statistic = age ~ "{mean} ({sd})",
              digits = age ~ c(2, 2)) |> 
  add_difference(test = age ~ "cohens_d")
Characteristic Drug A
N = 98
1
Drug B
N = 102
1
Difference2 95% CI2
Age 47.01 (14.71) 47.45 (14.01) -0.03 -0.32, 0.25
    Unknown 7 4

Abbreviation: CI = Confidence Interval
1 Mean (SD)
2 Cohen’s D

Effect size - Wilcox test

  • Using gtsummary:
trial |> 
  select(age, trt) |> 
  tbl_summary(by = trt,
              digits = all_continuous() ~ c(2, 2)) |> 
  add_difference(test = age ~ "wilcox.test")
Characteristic Drug A
N = 98
1
Drug B
N = 102
1
Difference2 95% CI2 p-value2
Age 46.00 (37.00, 60.00) 48.00 (39.00, 56.00) -1.0 -5.0, 4.0 0.7
    Unknown 7 4


Abbreviation: CI = Confidence Interval
1 Median (Q1, Q3)
2 Wilcoxon rank sum test
  • The difference is the median of the difference between a sample from group A and a sample from group B using the Wilcoxon rank sum test. The p-value comes from the same test.

Effect size - Proportions

  • Using gtsummary:
trial |> 
  select(response, trt) |> 
  tbl_summary(by = trt) |> 
  add_difference(test = response ~ "prop.test")
Characteristic Drug A
N = 98
1
Drug B
N = 102
1
Difference2 95% CI2 p-value2
Tumor Response 28 (29%) 33 (34%) -4.2% -18%, 9.9% 0.6
    Unknown 3 4


Abbreviation: CI = Confidence Interval
1 n (%)
2 2-sample test for equality of proportions with continuity correction
  • The difference reported is the difference in proportions between groups. The p-value is calculated using a proportion test.

Effect size - Proportions

  • Another way to report the difference in proportions is through relative risks.

gtsummary does not support calculating relative risks using the add_difference() function.

  • Calculating relative risks:
# install.packages(epitools)
library(epitools)

Effect size - Proportions

  • Another way to report the difference in proportions is through relative risks.

gtsummary does not support calculating relative risks using the add_difference() function.

  • Calculating relative risks:
rr <- riskratio(table(trial$trt, trial$response))
rr$measure
        risk ratio with 95% C.I.
         estimate     lower    upper
  Drug A 1.000000        NA       NA
  Drug B 1.142493 0.7528554 1.733785

The incidence of tumour response is 14% higher in the group treated with drug B than in the group treated with drug A.

Effect size - Proportions

  • Another way to report the difference in proportions is by using the odds ratio.

gtsummary does not support calculating odds ratio using the add_difference() function.

  • Calculating odds ratio:
or <- oddsratio(table(trial$trt, trial$response))
or$measure
        odds ratio with 95% C.I.
         estimate     lower    upper
  Drug A 1.000000        NA       NA
  Drug B 1.212851 0.6588486 2.244154

The odds of tumour response is 21% higher in the group treated with drug B than in the group treated with drug A.

Available tests in gtsummary

Link: https://www.danieldsjoberg.com/gtsummary/reference/tests.html

Summary

Handle with care
     …life is not so easy!


📅 Thanks & See You Tomorrow! 👋